Skip to content

feat: benchmark-agnostic evaluation protocol (BenchmarkProtocol + registry) (#107)#129

Merged
cagataycali merged 1 commit into
strands-labs:mainfrom
yinsong1986:feat/benchmark-protocol
May 13, 2026
Merged

feat: benchmark-agnostic evaluation protocol (BenchmarkProtocol + registry) (#107)#129
cagataycali merged 1 commit into
strands-labs:mainfrom
yinsong1986:feat/benchmark-protocol

Conversation

@yinsong1986
Copy link
Copy Markdown
Contributor

Summary

Implements #107: a benchmark-agnostic evaluation protocol so LIBERO, Meta-World,
RoboSuite, ManiSkill, and user-authored tasks can plug into the SimEngine eval
loop without committing the core to any single benchmark's conventions.

Adapters stay thin and ship in follow-up extras (#108 Meta-World, #109 RoboSuite,
#110 LIBERO). The existing success_fn path is kept working for backcompat.

What's in scope (from #107)

  1. BenchmarkProtocol ABC + StepInfo dataclass in strands_robots/simulation/benchmark.py
  2. PolicyRunner.evaluate widened to accept spec: BenchmarkProtocol + seed, alongside the existing success_fn path
  3. Tool actions + tool_spec.json entries: list_benchmarks, register_benchmark_from_file, evaluate_benchmark
  4. Named-predicate library in strands_robots/simulation/predicates.py (11 built-ins: body_above_z, joint_above, distance_less_than, inside_region, contact_between, contact_any, distance_neg, joint_progress, constant, …)
  5. Declarative YAML/JSON loader (register_benchmark_from_file) restricted to the named-predicate DSL - no eval, no exec, safe for LLM-authored specs
  6. Reference adapter: DeclarativeBenchmark (the DSL-driven adapter) serves as the end-to-end reference. A full MetaWorldAdapter is tracked as a separate PR (Benchmark adapter: Meta-World (MetaWorldAdapter) #108) since it needs the metaworld env + real task validation.
  7. Tests: 110 new tests covering protocol contract, cumulative reward, per-episode seed reproducibility, DSL compile + validation, backcompat, and MuJoCo dispatch

Architecture notes

  • Robot compatibility is first-class metadata. Default on_episode_start validates loaded robot data_config against supported_robots and auto-loads default_robot when the sim is empty. Mismatches surface as structured error dicts.
  • Registry mirrors register_urdf. Module-level dict[str, BenchmarkProtocol] guarded by an RLock. Re-registration is idempotent-overwrite with a warning.
  • Predicate library is a closed registry. YAML/JSON specs can only reference predicates in PREDICATE_REGISTRY. No eval. Sandboxed by construction.
  • Seed reproducibility. evaluate(spec=…, seed=42) seeds a master RNG, derives a per-episode child RNG, and threads it through spec.on_episode_start(sim, rng).
  • Signature-driven dispatch stays intact. No router code changes; just three new enum entries and two new top-level properties (benchmark_name, spec_path).

Verification

```bash
hatch run lint # ruff + mypy, all clean
hatch run pytest tests/simulation/test_benchmark.py \
tests/simulation/test_benchmark_predicates.py \
tests/simulation/test_benchmark_dsl.py \
tests/simulation/test_policy_runner_benchmark.py \
tests/simulation/mujoco/test_benchmark_dispatch.py

→ 110 passed

```

Full suite: 1347 passed, 31 skipped. 3 pre-existing failures all require OpenGL/OSMesa (confirmed failing on `main` too).

Example

Python:

```python
sim.register_benchmark_from_file(benchmark_name='drawer-open', spec_path='specs/drawer.yaml')
sim.evaluate_benchmark(benchmark_name='drawer-open', robot_name='arm', policy_provider='mock', n_episodes=10, seed=42)
```

Spec (`specs/drawer.yaml`):

```yaml
name: drawer-open
max_steps: 300
supported_robots: [panda]
default_robot: panda
success:
all:
- {predicate: joint_above, joint: drawer_slide, value: 0.15}
failure:
any:
- {predicate: body_below_z, body: gripper, z: -0.1}
dense_reward:

  • {predicate: distance_neg, body_a: gripper, body_b: drawer_handle, weight: 1.0}
  • {predicate: joint_progress, joint: drawer_slide, target: 0.2, weight: 5.0}
    ```

Out of scope (tracked as follow-ups)

Closes #107.

Introduce BenchmarkProtocol ABC + string-keyed registry so every standard
benchmark (LIBERO, Meta-World, RoboSuite, ManiSkill, user-authored tasks)
can plug into the SimEngine eval loop without committing the core to any
single benchmark's conventions. Adapters stay thin and ship in follow-up
extras (strands-labs#108 Meta-World, strands-labs#109 RoboSuite, strands-labs#110 LIBERO).

What lands here
- strands_robots/simulation/benchmark.py: BenchmarkProtocol ABC + StepInfo
  dataclass + thread-safe string-keyed registry. Robot compatibility is
  first-class metadata; default on_episode_start auto-loads default_robot
  and validates loaded robots against supported_robots.
- strands_robots/simulation/predicates.py: named-predicate library
  (body_above_z, joint_above, distance_less_than, inside_region,
  contact_between, contact_any, distance_neg, joint_progress, constant, ...).
  Closed registry, no eval() - safe for untrusted / LLM-authored specs.
- strands_robots/simulation/benchmark_spec.py: DeclarativeBenchmark +
  register_benchmark_from_file loading YAML/JSON specs (JSON via stdlib,
  YAML gated behind require_optional('pyyaml')).
- PolicyRunner.evaluate now accepts spec=BenchmarkProtocol and seed=int
  alongside the legacy success_fn path. Spec path adds cumulative reward,
  per-episode seeded RNG, is_failure early termination, and structured
  compatibility errors. Legacy path unchanged for backcompat.
- SimEngine base adds evaluate_benchmark / list_benchmarks /
  register_benchmark_from_file facades, auto-dispatched via the existing
  _dispatch_action path. MuJoCo tool_spec.json gains three action enum
  entries plus benchmark_name / spec_path properties.

Test coverage (110 new tests, all passing)
- Protocol contract, registry ops, thread-safety, compatibility errors
- Every predicate against lightweight fake sims + the reward math
- DSL validation (good/bad specs), file loading (JSON/YAML), sandboxing
- PolicyRunner.evaluate(spec=...) for cumulative reward, seed
  reproducibility, is_success/is_failure/done terminations, legacy
  backcompat, evaluate_benchmark / list_benchmarks / register_... facades
- MuJoCo dispatch path for all three new actions

Out of scope (tracked as follow-ups)
- Meta-World / RoboSuite / LIBERO adapters (strands-labs#108, strands-labs#109, strands-labs#110)
- BDDL parser, dense-reward curriculum tooling, RL training harness
- Replacing the existing success_fn path (kept working)

Refs strands-labs#107.
@yinsong1986
Copy link
Copy Markdown
Contributor Author

yinsong1986 commented May 11, 2026

Requesting review from @sundargthb, @awsarron, @cagataycali 🙏

Copy link
Copy Markdown
Member

@cagataycali cagataycali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Reviewed the architecture, ABC contract, registry, predicates library, DSL loader, and test coverage.

What I like:

  • Clean separation: BenchmarkProtocol ABC + StepInfo dataclass keep the eval loop protocol-agnostic
  • Closed predicate registry (no eval/exec) makes YAML specs safe for LLM-authored input
  • Thread-safe registry mirroring register_urdf pattern — consistent with existing architecture
  • on_episode_start default validates robot compatibility + auto-loads default_robot — sensible convention
  • 110 tests covering contract, thread safety, DSL compilation, seed reproducibility, and MuJoCo dispatch
  • Backcompat preserved: existing success_fn path unchanged

Architecture notes:

  • The PolicyRunner.evaluate(spec=...) widening keeps the existing surface area intact while enabling the new benchmark-driven path — nice dual-mode approach
  • Predicate factories returning (SimEngine) -> bool|float is the right abstraction level for backend-agnostic evaluation
  • The per-episode rng threading through on_episode_start(sim, rng) ensures reproducibility without mutable state on benchmark instances

This is a solid foundation for #110 (LIBERO), #108 (Meta-World), and #109 (RoboSuite) to build on.

@cagataycali cagataycali merged commit 572c155 into strands-labs:main May 13, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark-agnostic evaluation protocol (BenchmarkProtocol + registry)

2 participants